A Generic Architecture for the Conversion of Document Collections into Semantically Annotated Digital Archives

نویسندگان

Josep Lladós

Dimosthenis Karatzas

Joan Mas Romeu

Gemma Sánchez

چکیده

Mass digitization of document collections with further processing and semantic annotation is an increasing activity among libraries and archives at large for preservation, browsing and navigation, and search purposes. In this paper we propose a software architecture for the process of converting high volumes of document collections to semantically annotated digital libraries. The proposed architecture recognizes two sources of knowledge in the conversion pipeline, namely document images and humans. The Image Analysis module and the Correction and Validation module cover the initial conversion stages. In the former information is automatically extracted from document images. The latter involves human intervention at a technical level to define workflows and to validate the image processing results. The second stage, represented by the Knowledge Capture modules requires information specific to the particular knowledge domain and generally calls for expert practitioners. These two principal conversion stages are coupled with a Knowledge Management module which provides the means to organise the extracted and acquired knowledge. In terms of data propagation, the architecture follows a bottom-up process, starting with document image units, called terms, and progressively building meaningful concepts and their relationships. In the second part of the paper we describe a real scenario with historical document archives implemented according to the proposed architecture.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Semi-automated Xml Tagging of Public Text Archives: a Case Study

Public archives contain large and continuously growing volumes of electronically available text documents. In many countries, public authorities are required by law to publish certain data to satisfy the information needs of the general public. In contrast to plain text documents, semantically tagged XML documents along with appropriate query languages largely facilitate searching and browsing ...

متن کامل

Topic Models for Semantically Annotated Document Collections

Increasingly, web document collections such as PubMed and DBPedia, but also social bookmarking systems, are annotated with semantic meta data. Given that the number of semantically annotated document collections is expected to increase in the near future, it is of interest to analyze if topic models might be able to play a larger role. Since most of the time, annotations are noisy and even huma...

متن کامل

igital collections of semantically annotated cultural heritage texts

The creation of coherent archives to store the content of these documents in a meaningful way implies a totally different point of view, where the main focus is on the text and its meaning, rather than on the structure of its container (e.g. the tables of a database or the fields of a form). This document-centric approach provides a way of preserving the integrity of the original documents with...

متن کامل

Speechfind: an experimental on-line spoken document retrieval system for historical audio archives

In this study, we present the SpeechFind system, an experimental on-line spoken document retrieval system for historical audio archives. As part of an on-going U.S. NSF Digital Library Initiative project, entitled the National Gallery of the Spoken Word (NGSW), SpeechFind is intended to serve as an audio index and search engine for spoken word collections spanning the 20th century with as much ...

متن کامل

Chronicles in Preservation: Preserving Digital News and Newspapers

Since the mid-1990s, libraries and archives have been digitizing newspapers for preservation and access. The standards used for this work have evolved significantly during this time. Modern collections employ digitization techniques, metadata extraction and standards, and file formats that are very different compared to early collections. Increasingly, libraries and archives also include born-d...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

J. UCS

دوره 14 شماره

صفحات -

تاریخ انتشار 2008

A Generic Architecture for the Conversion of Document Collections into Semantically Annotated Digital Archives

نویسندگان

چکیده

منابع مشابه

Semi-automated Xml Tagging of Public Text Archives: a Case Study

Topic Models for Semantically Annotated Document Collections

igital collections of semantically annotated cultural heritage texts

Speechfind: an experimental on-line spoken document retrieval system for historical audio archives

Chronicles in Preservation: Preserving Digital News and Newspapers

عنوان ژورنال:

اشتراک گذاری